Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow keeping functional test datasets #3418

Merged
merged 3 commits into from
May 17, 2023

Conversation

dbutenhof
Copy link
Member

@dbutenhof dbutenhof commented May 11, 2023

PBENCH-1148

Allow disabling the test_delete_all test by setting the KEEP_DATASETS environment variable ... along with some squirrelly mechanism to get this set in the proper context. (Oddly, adding it to the tox.ini "safe" list wasn't enough to allow setting it outside; but we're already passing the server address as a command line parameter, so I added a second.)

Ultimately, this translates to a simple jenkins/runlocal --keep command.

PBENCH-1148

Allow disabling the `test_delete_all` test by setting the `KEEP_DATASETS`
environment variable ... along with some squirrelly mechanism to get this set
in the proper context. (Oddly, adding it to the `tox.ini` "safe" list wasn't
enough to allow setting it outside; but we're already passing the server
address as a command line parameter, so I added a second.)
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from some other small issues, isn't this change going to disrupt the ability to run individual tests?

jenkins/run-server-func-tests Outdated Show resolved Hide resolved
jenkins/run-server-func-tests Outdated Show resolved Hide resolved
jenkins/run-server-func-tests Outdated Show resolved Hide resolved
exec-tests Show resolved Hide resolved
webbnh
webbnh previously approved these changes May 15, 2023
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just got a question.

exec-tests Show resolved Hide resolved
jenkins/run-server-func-tests Outdated Show resolved Hide resolved
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@dbutenhof
Copy link
Member Author

Well this is interesting. Wonder if it'll recur ... ❔

Trying to pull quay.io/pbench/pbench-ci-fedora:main...
Error: initializing source docker://quay.io/pbench/pbench-ci-fedora:main: Requesting bearer token: invalid status code from registry 500 (Internal Server Error)

@webbnh
Copy link
Member

webbnh commented May 15, 2023

this is interesting

Yeah, and, despite that, we managed to get a journalctl dump...I wonder what it dumped?....

@dbutenhof
Copy link
Member Author

this is interesting

Yeah, and, despite that, we managed to get a journalctl dump...I wonder what it dumped?....

No, I think the failure was in the jenkins/run to fire up the functional tests; the server container and the backend pod appear to be up and running just fine.

@webbnh
Copy link
Member

webbnh commented May 15, 2023

Mr. Jenkins managed to pull that container twice earlier in the job, but each time he pulls a fresh image. 😒

Google says that that error indicates a disconnect between Quay and its back end database.

Looking at the journalctl dump, it looks to me like the Server container started fine, so the trouble is with the container pulled to run the tests in.

@dbutenhof
Copy link
Member Author

And now it fails the legacy test-51. This PR is haunted!

@webbnh
Copy link
Member

webbnh commented May 15, 2023

each time he pulls a fresh image. 😒

...and, this is by design, sort of. That is, we set EXTRA_PODMAN_SWITCHES="--pull=always ..." in the environment in the CI pipeline definition. Our intention (I think) was to make sure that we didn't use a stale container...but, now that we invoke jenkins/run several times in the job, I don't think we really want to re-pull the container each time.

We should probably change that to --pull=newer.... (In addition to saving us from redundant downloads, I think it might get us past this 500 error, too.)

@dbutenhof
Copy link
Member Author

each time he pulls a fresh image. unamused

...and, this is by design, sort of. That is, we set EXTRA_PODMAN_SWITCHES="--pull=always ..." in the environment in the CI pipeline definition. Our intention (I think) was to make sure that we didn't use a stale container...but, now that we invoke jenkins/run several times in the job, I don't think we really want to re-pull the container each time.

More efficient, to be sure; but in one sense that just makes the exposure points for whatever happened here less predictable. (I.e., whenever an executor finds a new pbench-ci-fedora image.)

We should probably change that to --pull=newer.... (In addition to saving us from redundant downloads, I think it might get us past this 500 error, too.)

Yeah, well it's particularly intriguing that it is revealed as a 500 error, which suggests that something triggered a quay "loophole".

@webbnh
Copy link
Member

webbnh commented May 16, 2023

that just makes the exposure points for whatever happened here less predictable

Actually, I don't think it diminishes the predictability, but, I think you're right that it doesn't necessarily improve it, either -- I think it has almost the same failure point. (That is, currently, we would fail any time we couldn't pull the image; using newer only takes action if the upstream image is different from the local one, and it is supposed to ignore pull errors, although I don't know whether that includes 500 errors....)

What we should do is pull the image at the start of the run (and fail early, if possible) and then use our cached image for the rest of the run. That might not be too hard to implement. (The problem is figuring out what "at the start of the run" means, across the cases of individual developer and CI, and from the Pipeline.gy file all the way down to run-server-functional-tests, the RPM build, and runlocal....)

something triggered a quay "loophole".

I think it was a transient on their end, especially since it hasn't recurred.

@dbutenhof
Copy link
Member Author

I think it was a transient on their end, especially since it hasn't recurred.

Ah; so the failure in a critical network service was transient. Well that's OK then. 🙉

@webbnh
Copy link
Member

webbnh commented May 16, 2023

that's OK then

Not so much "OK" as "beyond my control" (and, I expect, beyond my ability to do very much about).

But, if we didn't depend on pulling the container on every invocation, it might help.... 🥴

@dbutenhof dbutenhof merged commit 5d3ac3c into distributed-system-analysis:main May 17, 2023
@dbutenhof dbutenhof deleted the pop branch May 17, 2023 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants